Exploratory Data Analysis of Movies

starting with: https://www.kaggle.com/somyamaheshwari/data-analysis-and-visualization-plotly

Make the column names uniform

EX) Rename columns: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html df.rename(columns={"A": "a", "B": "c"})

Making all column names lower case for ease of use. Can also use: df.rename(str.lower, axis='columns')

Now for the dataset!

Making a new DF with these counts:

Comparing top imdb to top RT. not much overlap.

is the above double counting movies that list 2 countries??, no it doesen't seem so but its lieaving out stuff them.

language

total = len(movie_df.language) percent = (len(movie_df)/movie_df.language.count()) total percent lang_info = pd.concat([total, percent], axis=1, keys=['Total', 'Percent']) lang_info

????

turn above into a new table for each with percent each language

now using https://www.kaggle.com/nikhileshkos/recommended-ott-movies-shows-analysis

sns.pairplot(movie_df) fig=plt.gcf() fig.set_size_inches(20,20)

dropping all null data rows: movies.dropna(subset=['Directors', 'Genres', 'Country', 'Language', 'Runtime'],inplace=True)

checking for any duplicate data

movies.drop_duplicates(inplace=True)

Directors: Directors, in this column, some directors are present with a ',' so I'll split the names by ',' & then stack it one after the other for easy analysis. Then I wanted to find the director which has the maximum number of movies, to accomplish this, I have set a threshold (10) & plotted the directors which directed more than 10 movies.